- Assignment 1
- Due: 09/23/2020 11:55pm (EST)
- Assignment 2 is available now
- Due: 10/02/2020 11:55pm (EST)
- Office hours
- Monday 9 - 10am (EST)
- Wednesday 12:30 - 1:30pm (EST)
- Office hour notes
- Available on NYU Classes under the “Resources” tab
09/22/2020
“Get ten bagels when you arrive at the bakery. If you see muffins, get one.”
How many bagels will you buy?
A. 10
B. 1
“Get ten bagels when you arrive at the bakery. If you see muffins, get one.”
has_Muffin <- TRUE
shopping_at <- function (location) {
if (location == "bakery") {
if (has_Muffin == TRUE) {
bagel <- 1
} else {
bagel <- 10
}
}
return(bagel)
}
shopping_at(location = "bakery")
## [1] 1
has_Muffin <- TRUE
shopping_at <- function (human = TRUE, location) {
if (location == "bakery") {
if (human == TRUE) {
if (has_Muffin == TRUE) {
bagel <- 10
muffin <- 1
} else {
bagel <- 10
}
} else {
if (has_Muffin == TRUE) {
bagel <- 1
} else {
bagel <- 10
}
}
}
return(bagel)
}
shopping_at(human = TRUE, location = "bakery") # 10
shopping_at(human = FALSE, location = "bakery") # 1
To make things right, try to talk like this:
When you arrive at the bakery:
If there are muffins, buy 1 muffin and 10 bagels. Checkout.
If there is no muffin, buy 10 bagels. Checkout.
How we communicate:
Human: totally fine! don’t worry I got it!
Computer:
There are many types of regression analysis.
In this course, we will focus on linear regression.
\[ Y_i = \beta_0 + \beta_1 \times X_i + \varepsilon \] where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\varepsilon\) is the error.
Let’s create 100 data points with two variables: X and Y, which are highly correlated.
\[ Y_i = X_i + \varepsilon \]
## X Y Diff. ## 1 20.38222 21.37346 0.99123880 ## 2 20.51409 20.29707 -0.21702244 ## 3 17.89188 17.32735 -0.56452933 ## 4 19.27714 19.31970 0.04256237 ## 5 18.61547 19.46676 0.85129379 ## 6 18.24215 18.33422 0.09207247
p <- ggplot(dat_sim, aes(x = X, y = Y)) + geom_point() + theme_minimal() ggplotly(p)
paste("Correlation Coefficient: ",
round(cor(x = dat_sim$X, y = dat_sim$Y), digits = 3),
sep = "")
## [1] "Correlation Coefficient: 0.9"
Let’s fit a linear regression model. A linear regression defines a line:
\[ \hat{Y} = \beta_0 + \beta_1 X_i \]
Regression line is the line that minimizes the squared distances/errors between \(Y_i\) and the line. In this case:
\[ (\beta_0, \beta_1) = \mathcal{argmin} \sum_{i=1}^{100} (Y_i - \hat {Y_i})^2 = \mathcal{argmin} \sum_{i=1}^{100} (Y_i - (\beta_0 + \beta_1 X_i))^2 \]
\[ \begin{align} \mathcal{Q}(\beta_0, \beta_1) & = \sum_{i=1}^{n} (Y_i - \hat {Y_i})^2 = \sum_{i=1}^{n} (Y_i - (\beta_0 + \beta_1 X_i))^2 \\ & = \sum_{i=1}^n (Y_i - \beta_0 -\beta_1 X_i)^2 \\ & = \sum_{i=1}^n Y_{i}^2 + n \beta_0^2 + \beta_1^2 \sum_{i=1}^n X_i^2 - 2 \beta_0 \sum_{i=1}^n Y_i - 2 \beta_1 \sum_{i=1}^n Y_i X_i + 2 \beta_0 \beta_1 \sum_{i=1}^n X_i \\ \end{align} \]
\[ \mathcal{Q}(\beta_0, \beta_1) = \sum_{i=1}^n Y_{i}^2 + n \beta_0^2 + \beta_1^2 \sum_{i=1}^n X_i^2 - 2 \beta_0 \sum_{i=1}^n Y_i - 2 \beta_1 \sum_{i=1}^n Y_i X_i + 2 \beta_0 \beta_1 \sum_{i=1}^n X_i \]
\[ \frac{\partial Q}{\partial \beta_0} = 2n\beta_0 - 2n\overline{Y} + 2n\beta_1 \overline{X} \]
\[ \frac{\partial Q}{\partial \beta_1} = 2\beta_1 \sum_{i=1}^n X_i^2 + 2n\beta_0 \overline{X} - 2n\sum_{i=1}^n Y_i X_i \]
where \(\overline{Y}\) is the mean of \(Y\), and \(\overline{X}\) is the mean of \(X\).
\(\frac{\partial \mathcal{Q}}{\partial \beta_0}\) is the change of line equation if its intercept (\(\beta_0\)) is varied and its slope (\(\beta_1\)) is kept constant.
\(\frac{\partial \mathcal{Q}}{\partial \beta_1}\) is the change of line equation if its slope (\(\beta_1\)) is varied and its intercept (\(\beta_0\)) is kept constant.
To get the Ordinary Least Squares (OLS), we need to make sure there is no freedom of movement. This means, check if the second derivatives are positive.
\[ \frac{\partial \mathcal{Q}}{\partial \beta_0} = 2n \beta_0 - 2n \overline{Y} + 2n \beta_1 \overline{X} \]
\[ \frac{\partial \mathcal{Q}}{\partial \beta_0^2} = 2n \]
\[ \frac{\partial \mathcal{Q}}{\partial \beta_1} = 2 \beta_1 \sum_{i=1}^n X_i^2 + 2n \beta_0 \overline{X} - 2n \sum_{i=1}^n Y_i X_i \]
\[ \frac{\partial \mathcal{Q}}{\partial \beta_1^2} = 2 \sum_{i=1}^n X_i^2 \]
\[ \frac{\partial \mathcal{Q}}{\partial \beta_0 \beta_1} = 2 \sum_{i=1}^n X_i = 2n \overline{X} \]
Then, solve \(\frac{\partial \mathcal{Q}}{\partial \beta_0}\) and \(\frac{\partial \mathcal{Q}}{\partial \beta_1}\). We can get:
\[ \frac{\partial Q}{\partial \beta_0} = 2n\beta_0 - 2n\overline{Y} + 2n\beta_1 \overline{X} \]
\[ \frac{\partial Q}{\partial \beta_1} = 2\beta_1 \sum_{i=1}^n X_i^2 + 2n\beta_0 \overline{X} - 2n\sum_{i=1}^n Y_i X_i \]
\[ \beta_0 = \overline{Y} - \beta_1 \overline{X} \]
\[ \beta_1 = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})} {\sum_{i=1}^n (X_i - \overline{X})^2} = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})} {n \times Var.(X_i)} = \frac{Corr.(X, Y) \times S.D.(Y)}{S.D.(X)} \] The higher the correlation, the higher the slope.
2017 Cherry Blossom Run Data, 2017
Available on NYU Classes under the “Resources - Tong’s Lab - Lab 3”
Number of rows: 19,961
Variables
bib: number on the runner’s bib (ID)
name: initials
sex: biological sex
age
city: home city
net_sec: time record in seconds (accounting for staggered starting time)
clock_sec: time record in seconds (based on the clock)
pace_sec: average time per mile, in seconds
event: either the “10 Mile” race or the “5K”
# Set working directory
# Load the data set
dat <- read.csv("../../../data/run17.csv")
Dimensions: number of rows, number of columns
## [1] 19961
\(H_0\): For 10 Mile race, the mean of the net finish time of male runners is the same as female runners.
\(H_1\): For 10 Mile race, the mean of the net finish time of male runners is not the same as female runners.
dat_10m <- dat[dat$even == "10 Mile", ] dat_10m$sex <- ifelse(dat_10m$sex == "M", 1, 0)
To examine if pooled variance is suitable.
leveneTest(net_sec ~ factor(sex), data = dat_10m)
## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 1 34.267 4.891e-09 *** ## 17440 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t.test(net_sec ~ sex, data = dat_10m, var.equal = FALSE)
## ## Welch Two Sample t-test ## ## data: net_sec by sex ## t = 40.434, df = 14287, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 582.1310 641.4466 ## sample estimates: ## mean in group 0 mean in group 1 ## 6113.618 5501.829
summary(lm(net_sec ~ sex, data = dat_10m))
## ## Call: ## lm(formula = net_sec ~ sex, data = dat_10m) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2896.62 -687.62 -38.62 640.33 2896.17 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6113.618 9.454 646.69 <2e-16 *** ## sex -611.789 14.927 -40.98 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 966.2 on 17440 degrees of freedom ## Multiple R-squared: 0.08786, Adjusted R-squared: 0.0878 ## F-statistic: 1680 on 1 and 17440 DF, p-value: < 2.2e-16
plot(x = dat_10m$sex, y = dat_10m$net_sec, type = "p") abline(a = 6113.618, b = -611.789, lwd = 2)
On average, the older, the slower.
summary(lm2 <- lm(net_sec ~ age, data = dat_10m))
## ## Call: ## lm(formula = net_sec ~ age, data = dat_10m) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2969.38 -700.70 -20.42 672.78 2670.51 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5388.0324 27.2989 197.37 <2e-16 *** ## age 12.9782 0.7087 18.31 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1002 on 17440 degrees of freedom ## Multiple R-squared: 0.01886, Adjusted R-squared: 0.01881 ## F-statistic: 335.3 on 1 and 17440 DF, p-value: < 2.2e-16
plot(x = dat_10m$age, y = dat_10m$net_sec, type = "p", col = "grey") abline(a = 5388.0324, b = 12.9782, lwd = 2)
dat_10m$age0 <- dat_10m$age - mean(dat_10m$age) summary(lm3 <- lm(net_sec ~ age0, data = dat_10m))
## ## Call: ## lm(formula = net_sec ~ age0, data = dat_10m) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2969.38 -700.70 -20.42 672.78 2670.51 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5868.2292 7.5877 773.38 <2e-16 *** ## age0 12.9782 0.7087 18.31 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1002 on 17440 degrees of freedom ## Multiple R-squared: 0.01886, Adjusted R-squared: 0.01881 ## F-statistic: 335.3 on 1 and 17440 DF, p-value: < 2.2e-16
plot(x = dat_10m$age0, y = dat_10m$net_sec, type = "p", col = "grey") abline(a = 5868.2292, b = 12.9782, lwd = 2) abline(v = 0, lty = "dashed", col = "red")
dat_10m$age1 <- dat_10m$age0 / sd(dat_10m$age0) summary(lm4 <- lm(net_sec ~ age1, data = dat_10m))
## ## Call: ## lm(formula = net_sec ~ age1, data = dat_10m) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2969.38 -700.70 -20.42 672.78 2670.51 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5868.229 7.588 773.38 <2e-16 *** ## age1 138.950 7.588 18.31 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1002 on 17440 degrees of freedom ## Multiple R-squared: 0.01886, Adjusted R-squared: 0.01881 ## F-statistic: 335.3 on 1 and 17440 DF, p-value: < 2.2e-16
plot(x = dat_10m$age1, y = dat_10m$net_sec, type = "p", col = "grey", xlab = "SD") abline(a = 5868.229, b = 138.950, lwd = 2) abline(v = 0, lty = "dashed", col = "red")